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Abstract 

This paper presents a new measure of semantic 
similarity in an IS-A taxonomy, based on the 
notion of information content. Experimental 
evaluation suggests that the measure performs 
encouragingly well (a correlation of r = 0.79 
with a benchmark set of human similarity judg- 
ments, with an upper bound of r = 0.90 for hu- 
man subjects performing the same task), and 
significantly better than the traditional edge 
counting approach (r — 0.66). 

1 Introduction 

Evaluating semantic relatedness using network represen- 
tations is a problem with a long history in artificial in- 
telligence and psychology, dating back to the spreading 
activati on ap proach of Quillian ]l96^ and Collins and 
Loftus [1975 1 . Semantic similarity represents a special 



case of semantic relatedness: for example, cars and gaso- 
line would seem to be more closely related than, say, cars 
and bicycles, but t he lat ter pair are certainly more sim- 
ilar. Rada et al. [1989| suggest that the assessment of 
similarity in semantic networks can in fact be thought of 
as involving just taxonomic (is-a) links, to the exclusion 
of other link types; that view will also be taken here, 
although admittedly it excludes some potentially useful 
information. 

A natural way to evaluate semantic similarity in a tax- 
onomy is to evaluate the distance between the nodes cor- 
responding to the items being compared — the shorter 
the path from one node to another, the more similar they 
are. Given multiple paths, one takes the length of the 
shortest one [Lee et ai, 1993 ; Rada and Bicknell, 198£; 
Rada et al, WE^ . 



A widely acknowledged problem with this approach, 
however, is that it relies on the notion that links in 
the taxonomy represent uniform distances. Unfortu- 
nately, this is difficult to define, much less to control. 
In real taxonomies, there is wide variability in the "dis- 
tance" covered by a single taxonomic link, particularly 
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when certain sub-taxonomies (e.g. biological categories) 
are much denser than others. For example, in Word- 
Net iMillcr, 199C| ], a broad -coverage semantic network 
for English constructed by George Miller and colleagues 
at Princeton, it is not at all difficult to find links that 
cover an intuitively narrow distance (rabbit ears is-A 
TELEVISION antenna) Or an intuitively wide one (phy- 
toplankton is-a living thing). The same kinds of 
exam ples can be found in the Collins COBUILD Dictio- 
nary pinclair (ed.), 1987 ], which identifies superordinate 
terms for many words (e.g. safety valve is-A valve 
seems a lot narrower than knitting machine is-A ma- 
chine). 

In this paper, I describe an alternative way to evaluate 
semantic similarity in a taxonomy, based on the notion 
of information content. Like the edge counting method, 
it is conceptually quite simple. However, it is not sen- 
sitive to the problem of varying link distances. In addi- 
tion, by combining a taxonomic structure with empiri- 
cal probability estimates, it provides a way of adapting 
a static knowledge structure to multiple contexts. Sec- 
tion]^ sets up the probabilistic framework and defines the 
measure of semantic similarity in information-theoretic 
terms; Section |^ presents an evaluation of the similarity 
measure against human similarity judgments, using the 
simple edge-counting method as a baseline; and Section^ 
discusses related work. 

2 Similarity and Information Content 

Let C be the set of concepts in an IS-A taxonomy, permit- 
ting multiple inheritance. Intuitively, one key to the sim- 
ilarity of two concepts is the extent to which they share 
information in common, indicated in an IS-A taxonomy 
by a highly specific concept that subsumes them both. 
The edge counting method captures this indirectly, since 
if the minimal path of IS-A links between two nodes is 
long, that means it is necessary to go high in the tax- 
onomy, to more abstract concepts, in order to find a 
least upper bound. For example, in WordNet, nickel 
and dime are both subsumed by COIN, whereas the most 
specific superclass that nigkel and credit card share 
is MEDIUM OF EXCHANGE.Id (See Figure |l|.) 



^In a feature-based setting (e.g. [Tversky, 1977 1), this 



would be reflected by explicit shared features: nickels and 



MEDIUM OF EXCHANGE 



SUBSTANCE 



CRYSTAL WEALTH 



CHEMICAL ELEMENT METAL TREASURE 



NICKEL DIME CREDIT CARD 



NICKEL DIME 



NICKEL' GOLD' 



Figure 1: Fragment of the WordNet taxonomy. Solid 
lines represent IS- A links; dashed lines indicate that some 
intervening nodes ■were omitted to save space. 



By associating probabilities ■with concepts in the tax- 
onomy, it is possible to capture the same idea, but avoid- 
ing the unreliability of edge distances. Let the taxonomy 
be augmented ■with a function p : C — > [0, 1], such that 
for any c G C, p(c) is the probability of encountering 
an instance of concept c. This implies that p is mono- 
tonic as one moves up the taxonomy: if ci IS- A C2, then 
p(ci) < p(c2)- Moreover, if the taxonomy has a unique 
top node then its probability is 1. 

FoUo' wing the sta ndard argumentation of information 
theory |Ross, 197£l, the information content of a con- 



cept c can be quantified as negative the log likelihood, 
— logp(c). Notice that quantifying information content 
in this ■way makes intuitive sense in this setting: as prob- 
ability increases, informativeness decreases, so the more 
abstract a concept, the lo"wer its information content. 
Moreover, if there is a unique top concept, its informa- 
tion content is 0. 

This quantitative characterization of information pro- 
vides a new 'way to measure semantic similarity. The 
more information t'wo concepts share in common, the 
more similar they are, and the information shared by 
t'wo concepts is indicated by the information content of 
the concepts that subsume them in the taxonomy. For- 
mally, define 



sim(ci,C2) 



max 



logp(c)] 



(1) 



Figure 2: Another fragment of the WordNet taxonomy 

ity, rather than concept similarity. Using s{w) to repre- 
sent the set of concepts in the taxonomy that are senses 
of 'word w, define 

sim(?i;i,W2) = [sim(ci, C2)] , (2) 

cl, c2 

■where ci ranges over s{wi) and C2 ra nges over s{w2)- 
This is consistent 'with Rada et al.'s [ 1989 | treatment 
of "disjunctive concepts" using edge counting: they de- 
fine the distance bet'ween two disjunctive sets of con- 
cepts as the minimum path length from any element of 
the first set to any element of the second. Here, the 
'word similarity is judged by taking the maximal infor- 
mation content over all concepts of 'which both ■words 
could be an instance. For example, Figure ^ illustrates 
ho^w the similarity of words nickel and gold would be 
computed: the information content would be computed 
for all classes subsuming any pair in the cross product 
of {nickel, nickel'} and {gold, gold'}, and the infor- 
mation content of the most informative class used to 
quantify the similarity of the two words. 

3 Evaluation 

3.1 Implementation 

The work reported here used WordNet's (50,000-node) 
taxonomy of concepts represented by nouns (and com- 
pound nominals) in English.cl Frequencies of concepts 
in the taxonomy were estimated using noun fre quencies 
from the Brown C orpus of American English [ Francis 



whe ro S{ci, oj) ia the act of conccpta that aubaumc both 
Cl and C2. Notice that although similarity is computed 
by considering all upper bounds for the two concepts, 
the information measure has the effect of identifying 
minimal upper bounds, since no class is less informa- 
tive than its superordinates. For example, in Figure n^, 
COIN, CASH, etc. are all members of 5* (nickel, dime), 
but the concept that is structurally the minimal upper 
bound, COIN, will also be the most informative. This 
can make a difference in cases of multiple inheritance; for 
example, in Figure ^, metal and chemical element 
are not structurally distinguishable as upper bounds of 
nickel' and gold', but their information content may 
in fact be quite different. 

In practice, one often needs to measure word similar- 



and Kucera, 1982|, a large (1,000,000 word) collection 



of text across genres ranging from news articles to sci- 
ence fiction. Each noun that occurred in the corpus was 
counted as. an occurrence of each taxonomic class con- 
taining it.El For example, in Figure |l|, an occurrence of 
the noun dime would be counted toward the frequency 
of DIME, coin, and so forth. Formally, 



freq(c) 



count (71), 

r!GWOrds(c) 



(3) 



where words(c) is the set of words subsumed by concept 
c. Concept probabilities were computed simply as rela- 
tive frequency: 

freq(c) 



N 



(4) 



dimes are both small, round, metallic, and so on. These fea- 
tures are captured implicitly by the taxonomy in categorizing 
NICKEL and dime as subordinates of coin. 



^Concept as used here refers to what Miller et al. | |l99C| ] 
call a synset, essentially a node in the taxonomy. 

^Plural nouns counted as instances of their singular forms. 



where N was the total number of nouns observed (ex- 
cluding those not subsumed by any WordNet class, of 
course). 

3.2 Task 

Although there is no standard way to evaluate compu- 
tational measures of semantic similarity, one reasonable 
way to judge would seem to be agreement with human 
similarity ratings. This can be assessed by using a com- 
putational similarity measure to rate the similarity of a 
set of word pairs, and looking at how well its ratings 
correlate with human ratings of the sam e pair s. 

An experiment by Miller and Charles |l99ll provided 
appropriate human subject data for the task. In their 
study, 38 undergraduate subjects were given 30 pairs of 
nouns that were chosen to cover high, intermediate, and 
low le vels of similarity (as determined usin g a previous 
study iRubenstein and Goodenough, 1965 1), and asked 
to rate "similarity of meaning" for each pair on a scale 
from (no similarity) to 4 (perfect synonymy). The av- 
erage rating for each pair thus represents a good estimate 
of how similar the two words are, according to human 
judgments. 

In order to get a baseline for comparison, I replicated 
Miller and Charles's experiment, giving ten subjects the 
same 30 noun pairs. The subjects were all computer 
science graduate students or postdocs at the University 
of Pennsylvania, and the instructions were exactly the 
same as used by Miller and Charles, the main difference 
being that in this replication the subjects completed the 
questionnaire by electronic mail (though they were in- 
structed to complete the whole thing in a single uninter- 
rupted sitting). Five subjects received the list of word 
pairs in a random order, and the other five received the 
list in the reverse order. The correlation between the 
Miller and Charles mean ratings and the mean ratings 
in my replication was .96, quite close to the .97 corre- 
lation that Miller and Charles obtained between their 
results and the ratings determined by the earlier study. 

For each subject in my replication, I computed how 
well his or her ratings correlated with the Miller and 
Charles ratings. The average correlation over the 10 
subjects was r = 0.8848, with a standard deviation of 
0.08.U This value represents an upper bound on what 
one should expect from a computational attempt to per- 
form the same task. 

For purposes of evaluation, three computational sim- 
ilarity measures were used. The first is the similarity 
measurement using information content proposed in the 
previous section. The second is a variant on the edge 
counting method, converting it from distance to similar- 
ity by subtracting the path length from the maximum 
possible path length: 



simg^gg(it;i, W2) = (2xMAx)- 



minlen(ci, C2) 

Cl, C2 



(5) 



where ci ranges over s{wi), ci ranges over s(w2), MAX 
is the maximum depth of the taxonomy, and len(ci, C2) 



Similarity method 


Correlation 


Human judgments (replication) 


r = .9015 


Information content 


r = .7911 


Probability 


r = .6671 


Edge counting 


T = .6645 



Table 1: Summary of experimental results. 



is the length of the shortest path from c\ to C2 . (Recall 
that s{w) denotes the set of concepts in the taxonomy 
that represent senses of word w.) Note that the con- 
version from a distance to a similarity can be viewed as 
an expository convenience, and does not affect the eval- 
uation: although the sign of the correlation coefficient 
changes from positive to negative, its magnitude turns 
out to be just the same regardless of whether or not the 
minimum path length is subtracted from (2 x max). 

The third point of comparison is a measure that sim- 
ply uses the probability of a concept, rather than the 
information content: 



simp(c)(ci,C2) 
simp(c) (wi, W2) 



P(c)] 



max j2 
max [simp(c)(ci,C2)] 

Cl,C2 



(6) 
(7) 



Inter-subject correlation in the replication, estimated us - 
ing leaving-one-out resampling [Weiss and Kulikowski, 1991 1, 
was r = .9026, stdev = 0.07. 



where c\ ranges over s(w\) and C2 ranges over siw-i) 
in ( 7|). Again, the difference between maximizing 1 — p(c) 
anci minimizing p(c) turns out not to affect the magni- 
tude of the correlation. It simply ensures that the value 
can be interpreted as a similarity value, with high values 
indicating similar words. 

3.3 Results 

Table |l| summarizes the experimental results, giving the 
correlation between the similarity ratings and the mean 
ratings reported by Miller and Charles. Note that, owing 
to a noun missing from the WordNet taxonomy, it was 
only possible to obtain computational similarity ratings 
for 28 of the 30 noun pairs; hence the proper point of 
comparison for human judgments is not the correlation 
over all 30 items (r — .8848), but rather the correlation 
over the 28 included pairs [r = .9015). The similarity 
ratings by item are given in Table |. 

3.4 Discussion 

The experimental results in the previous section suggest 
that measuring semantic similarity using information 
content provides quite reasonable results, significantly 
better than the traditional method of simply counting 
the number of intervening IS- A links. 

The measure is not without its problems, however. 
One problem is that, like simple edge counting, the mea- 
sure sometimes produces spuriously high similarity mea- 
sures for words on the basis of inappropriate word senses. 
For example. Table |^ shows the word similarity for sev- 
eral words with tobacco. Tobacco and alcohol are similar, 
both being drugs, and tobacco and sugar are less simi- 
lar, though not entirely dissimilar, since both can be 
classified as substances. The problem arises, however, 
in the similarity rating for tobacco with horse: the word 



nl 


n2 


sini(nl,n2) 


class 


tobacco 


alcohol 


7.63 


DRUG 


tobacco 


sugar 


3.56 


SUBSTANCE 


tobacco 


horse 


8.26 


NARCOTIC 



the two concepts being compared rather than the prob- 
ability of a subsuming concept. Specifically, they define 



Table 2: Similarity with tobacco computed by maximiz- 
ing information content 



horse can be used as a slang term for heroin, and as 
a result information-based similarity is maximized, and 
path length minimized, when the two words are both 
categorized as narcotics. This is contrary to intuition. 

Cases like this are probably relatively rare. However, 
the example illustrates a more general concern: in mea- 
suring similarity between words, it is really the relation- 
ship among word senses that matters, and a similarity 
measure should be able to take this into account. 

In the absence of a reliable algorithm for choosing the 
appropriate word senses, the most straightforward way 
to do so in the information-based setting is to consider 
all concepts to which both nouns belong rather than 
taking just the single maximally informative class. This 
suggests redefining similarity as follows: 



sim(ci,C2) = ^a(c,)[-logp(c,)]. 



(8) 



where {q} is the set of concepts dominating both ci 
and C2, as before, and J^i^i^i) = 1- This measure of 
similarity takes more information into account than the 
previous one: rather than relying on the single concept 
with maximum information content, it allows each class 
to contribute information content according to the value 
of a{ci). Intuitively, these a values measure relevance — 
for example, q;(narcotic) might be low in general usage 
but high in the context of a newspaper article about drug 
dealers. In work on r esolving syntact ic ambiguity using 
semantic information |Resnik, 1993b |, I have found that 



local syntactic information can be used successfully to 
set values for the a. 

4 Related Work 

Although the counting of edges in IS-A taxonomies seems 
to be something many people have tried, there seem to be 
few published descriptions of attempts to directly eval- 
uate the effectiveness of this method. A number of re- 
searchers have attempted to make use of conceptual dis- 
tance i n info rmation retrieval. For example, Rada et al. 
1 1989; 1989] and Lee et al. |l993| report experiments 



using conceptual distance, implemented using the edge 
counting metric, as the basis for ra nking documents by 
their similarity to a query. Sussna 1993[ | uses semantic 
relatedness measured with WordNet in word sense dis- 
ambiguation, defining a measure of distance that weights 
different types of links and also explicitly takes depth in 
the taxonomy into account. 

The most relevant related work appears in an un - 
published manuscript by Leacock and Chodorow [l994|. 



s™ndist(^i'^2) 



log 



min len(ci, C2) 

Cl, C2 

(2 X max) 



(9) 



(The notation above is the same as for equation (||).) 
In addition to this definition, they also include several 
special cases, most notably to avoid infinite similarity 
when Cl and C2 are exact synonyms and thus have a path 
length of 0. Leacock and Chodorow have experimented 
with this measure and the information content measure 
described here in the context of word sense disambigua- 
tion, and found that they yield roughly similar results. 
More significantly, I recently implemented their method 
and tested it on the task reported in the previous section, 
and found that it actually outperforms the information- 
based measure. This led me to do a followup experiment 
using a different and larger set of noun pairs, and in 
the followup stiady the information-based measure per- 
formed better.tl The relationship between the two algo- 
rithms will thus require further study. For now, however, 
what seems most significant is that both approaches 
take the form of a log-based (and he nce information -like) 
measure, as originally proposed in [Rcsnik, 1993al. 

Finally, in the context of current research in compu- 
tational linguistics, the approach to semantic similarity 
taken here can be viewed as a hybrid, combining corpus- 
based statistical methods with knowledge-based taxo- 
nomic information. The use of corpus statistics alone 
in evaluating word similarity — without prior taxonomic 
knowledge — is currently an active area of research in the 
natural language community. This is largely a reaction 
to sparse data problems in training statistical language 
models: it is difficult to come up with an accurate statis- 
tical characterization of the behavior of words that have 
been encountered few times or not at all. Word similarity 
appears to be one promising way to solve the problem: 
the behavior of a word is approximated by smoothing its 
observed behavior together with the behavior of words 
to which it is similar. For example, a speech recognizer 
that has never seen the phrase ate a peach can still con- 
clude that John ate a peach is a reasonable sequence of 
words in English if it has seen other sentences like Mary 
ate a pear and knows that peach and pear have similar 
behavior. 

The literature on corpus-based determination of word 
similarity has recently been growing by leaps and 
bounds, and is too extensive to discuss in detail here 



(for a review, see |Resnik, 1993a ]), but most approaches 
to the problem share a common assumption: semanti- 
cally similar words have similar distributional behavior 
in a corpus. Using this assumption, it is common to 
treat the words that co-occur near a word as constitut- 
ing features, and to compute word similarity in terms 
of how similar their feature sets are. As in information 
retrieval, the "feature" representation of a word often 



They have defined a measure resembling information 
content, but using the normalized path length between 



^In the followup study, I used netnews archives to gather 
highly frequent nouns within related topic areas, and then 
selected noun pairings at random, in order to avoid biasing 
the followup study in favor of either algorithm. 



takes the form of a vector, with the similarity computa- 
tion amounting to a computation of distance in a highly 
multidimensional space. Given a distance measure, it 
is not uncommon to derive word classes by hierarchical 
clustering. A difficulty with most distributional meth- 
ods, however, is how the measure of similarity (or dis- 
tance) is to be interpreted. Although word classes result- 
ing from distributional clustering are often described as 
"semantic," they often capture syntactic, pragmatic, or 
stylistic factors as well. 

5 Conclusions 

This paper has presented a new measure of semantic 
similarity in an IS-A taxonomy, based on the notion of 
information content. Experimental evaluation was per- 
formed using a large, independently constructed corpus, 
an independently constructed taxonomy, and previously 
existing human subject data. The results suggest that 
the measure performs encouragingly well (a correlation 
of r = 0.79 with a benchmark set of human similar- 
ity judgments, against an upper bound of r = 0.90 for 
human subjects performing the same task), and signif- 
icantly better than the traditional edge counting ap- 
proach (r = 0.66). 

In ongoing work, I am currently exploring the appli- 
cation of taxonomically-based s emantic simila rity in the 
disambiguation of word senses [Resnik, 1995]. The idea 
behind the approach is that when polysemous words ap- 
pear together, the appropriate word senses to assign are 
often those that share elements of meaning. Thus doctor 
can refer to either a Ph.D. or an M.D., and nurse can 
signify either a health professional or someone who takes 
care of small children; but when doctor and nurse are 
seen together, the Ph.D. sense and the childcar e sense go 
by the wayside. In a widely known paper, Lesk | |1986| ex- 
ploits dictionary definitions to identify shared elements 
of meaning — for example, in the Collins COBUILD 
Dictionary [Sinclair (ed.), 1987 1, the word ill can be 
found in the dchnitions of the correct senses. More re- 



cently, Sussna | 1993 | has explored using similarity of 
word senses based on WordNet for the same purpose. 
The work I am pursuing is similar in spirit to Sussna's 
approach, although the disambiguation algorithm and 
the similarity measure differ substantially. 
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0.0000 





0.0000 


lad 


wizard 


0.42 


0.7 


2.9683 


26 


0.8722 


chord 


smile 


0.13 


0.1 


2.3544 


20 


0.8044 


glass 


magician 


0.11 


0.1 


1.0105 


22 


0.5036 


noon 


string 


0.08 


0.0 


0.0000 





0.0000 


rooster 


\"oyHg(> 


0.08 


0.0 


0.0000 





0.0000 



Table 3: Semantic similarity by item. 



